{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NPUGFpKP1_BH"
   },
   "source": [
    "# Demonstrating the `AgentEval` framework using the task of solving math problems as an example\n",
    "\n",
    "This notebook aims to demonstrate how `AgentEval` works in an offline scenario, where we use a math problem-solving task as an example. \n",
    "`AgentEval` consists of two key steps:\n",
    "\n",
    "- `generate_criteria`: This is an LLM-based function that generates a list of criteria $(c_1, \\dots, c_n)$ to help to evaluate a utility given task.\n",
    "\n",
    "- `quantify_criteria`: This function quantifies the performance of any sample task based on the criteria generated in the `generate_criteria` step in the following way: $(c_1=a_1, \\dots, c_n=a_n)$\n",
    "\n",
    "![AgentEval](https://raw.githubusercontent.com/ag2ai/ag2/refs/heads/main/website/docs/_blogs/2023-11-20-AgentEval/img/agenteval-CQ.webp)\n",
    "\n",
    "For more detailed explanations, please refer to the accompanying [blog post](https://docs.ag2.ai/latest/docs/blog/2023/11/20/AgentEval)\n",
    "\n",
    "## Requirements\n",
    "\n",
    "AG2 requires `Python>=3.9`. To run this notebook example, please install ag2, Docker, and OpenAI:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "68lTZZyJ1_BI",
    "outputId": "15a55fab-e13a-4654-b8cb-ae117478d6d8"
   },
   "outputs": [],
   "source": [
    "%pip install \"ag2>=0.2.3\" docker\n",
    "%pip install scipy\n",
    "%pip install matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HxgqKJrd1_BJ"
   },
   "source": [
    "## Set your API Endpoint\n",
    "* The [`config_list_from_json`](https://docs.ag2.ai/latest/docs/api-reference/autogen/config_list_from_json/#autogen.config_list_from_json) function loads a list of configurations from an environment variable or a json file. It first looks for an environment variable with a specified name. The value of the environment variable needs to be a valid json string. If that variable is not found, it looks for a json file with the same name. It filters the configs by filter_dict.\n",
    "\n",
    "You can set the value of config_list in any way you prefer. Please refer to this [User Guide](https://docs.ag2.ai/latest/docs/user-guide/basic-concepts/llm-configuration/) for full code examples of the different methods.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YRycFEDJ1_BJ"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from contextlib import suppress\n",
    "from pathlib import Path\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import scipy.stats as stats\n",
    "\n",
    "import autogen\n",
    "from autogen.agentchat.contrib.agent_eval.agent_eval import generate_criteria, quantify_criteria\n",
    "from autogen.agentchat.contrib.agent_eval.criterion import Criterion\n",
    "from autogen.agentchat.contrib.agent_eval.task import Task\n",
    "\n",
    "config_list = autogen.config_list_from_json(\"OAI_CONFIG_LIST\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6vPTtNkhk2V1"
   },
   "source": [
    "# Run the Critic\n",
    "\n",
    "To run the critic, we need a couple of math problem examples. One of them failed to solve the problem successfully, given in `agenteval-in-out/response_failed.txt`, and the other one was solved successfully, i.e., `agenteval-in-out/response_successful.txt`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5H1WRs_wkiK0"
   },
   "outputs": [],
   "source": [
    "def remove_ground_truth(test_case):\n",
    "    test_details = json.loads(test_case)\n",
    "    # need to remove the ground truth from the test details\n",
    "    correctness = test_details.pop(\"is_correct\", None)\n",
    "    test_details.pop(\"correct_ans\", None)\n",
    "    test_details.pop(\"check_result\", None)\n",
    "    return str(test_details), correctness\n",
    "\n",
    "\n",
    "# Reading one successful and one failed example of the task\n",
    "success_str = open(\"../test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt\").read()  # noqa: SIM115\n",
    "response_successful = remove_ground_truth(success_str)[0]\n",
    "failed_str = open(\"../test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt\").read()  # noqa: SIM115\n",
    "response_failed = remove_ground_truth(failed_str)[0]\n",
    "\n",
    "task = Task(**{\n",
    "    \"name\": \"Math problem solving\",\n",
    "    \"description\": \"Given any question, the system needs to solve the problem as consisely and accurately as possible\",\n",
    "    \"successful_response\": response_successful,\n",
    "    \"failed_response\": response_failed,\n",
    "})\n",
    "\n",
    "criteria = generate_criteria(task=task, llm_config={\"config_list\": config_list}, max_round=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Vu70o024lenI"
   },
   "source": [
    "# The Criteria\n",
    "Now, we print the designed criteria for assessing math problems. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "k9DsDB5hqvtG",
    "outputId": "0edd7a0c-b031-4f67-efc6-1a1e77066921"
   },
   "outputs": [],
   "source": [
    "current_task_name = \"_\".join(task.name.split()).lower()\n",
    "cr_file = open(f\"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json\", \"w\")  # noqa: SIM115\n",
    "cr_file.write(Criterion.write_json(criteria))\n",
    "cr_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PETPZluOEGCR"
   },
   "source": [
    "*Note :* You can also define and use your own criteria in order to feed into the quantifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SmpUZv_ylo9U"
   },
   "source": [
    "# The `QuantifierAgent`\n",
    "\n",
    "Once we have the criteria, we need to quantify a new sample based on the designed criteria and its accepted values. This will be done through `quantify_criteria` from agent_eval. \n",
    "Again, you can use your own defined criteria in `criteria_file`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4uUkZJh_subA"
   },
   "outputs": [],
   "source": [
    "criteria_file = f\"../test/test_files/agenteval-in-out/{current_task_name}_criteria.json\"\n",
    "criteria = open(criteria_file).read()  # noqa: SIM115\n",
    "criteria = Criterion.parse_json_str(criteria)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "64rRJfB2l6lO"
   },
   "source": [
    "## Running the quantifier on a single test case"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we run the quantifier on a single math problem test case, `sample_test_case.json`, for demonstration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Pf623aNbHZTG",
    "outputId": "0031871b-a438-43f5-d2b2-c99fa1ad0dbd"
   },
   "outputs": [],
   "source": [
    "test_case = open(\"../test/test_files/agenteval-in-out/samples/sample_test_case.json\").read()  # noqa: SIM115\n",
    "test_case, ground_truth = remove_ground_truth(test_case)\n",
    "quantifier_output = quantify_criteria(\n",
    "    llm_config={\"config_list\": config_list},\n",
    "    criteria=criteria,\n",
    "    task=task,\n",
    "    test_case=test_case,\n",
    "    ground_truth=ground_truth,\n",
    ")\n",
    "print(\"actual correctness:\", quantifier_output[\"actual_success\"])\n",
    "print(\"predicted correctness:\\n\", quantifier_output[\"estimated_performance\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2VtdM44WEGCS"
   },
   "source": [
    "# Run `AgentEval` on the logs\n",
    "\n",
    "In the example below, log_path points to the sample logs folder to run the quantifier. The current sample belongs to the prealgebra category which will be downloaded from [here](https://github.com/julianakiseleva/autogen/tree/agenteval/test/test_files/agenteval-in-out/samples).\n",
    "In case you want to replicate the results described in the blog post, you can download all the logs for math problems using the following [link](https://github.com/julianakiseleva/autogen/tree/3ac137d52a20a013a39b48d0fea8a8b720116c66/model-logs/math-problems/agentchat). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can set your own log path - we also limited the number of samples to avoid additional costs.\n",
    "# By removing the condition about limitations on the number of samples per category, you can run it on all 120 problems\n",
    "\n",
    "log_path = \"../test/test_files/agenteval-in-out/agentchat_results/\"\n",
    "\n",
    "# The file is no longer in the repo, we can download it from an older commit\n",
    "!wget https://github.com/julianakiseleva/autogen/raw/ddabd4f0e7c13a50e33cf8462e79358666371477/test/test_files/agenteval-in-out/prealgebra.zip\n",
    "!unzip -o prealgebra.zip -d {log_path}\n",
    "!rm  prealgebra.zip\n",
    "\n",
    "assert Path(log_path).exists(), f\"The log path '{log_path}' does not exist.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "dZdIbHPFEGCS",
    "outputId": "83c0a51b-f184-494b-81a0-d4b4a3667319"
   },
   "outputs": [],
   "source": [
    "criteria_file = \"../test/test_files/agenteval-in-out/samples/sample_math_criteria.json\"\n",
    "criteria = Criterion.parse_json_str(open(criteria_file).read())  # noqa: SIM115\n",
    "outcome = {}\n",
    "\n",
    "for prefix in os.listdir(log_path):\n",
    "    for file_name in os.listdir(log_path + \"/\" + prefix):\n",
    "        gameid = prefix + \"_\" + file_name\n",
    "        if file_name.split(\".\")[-1] == \"json\":\n",
    "            test_case, ground_truth = remove_ground_truth(open(log_path + \"/\" + prefix + \"/\" + file_name).read())  # noqa: SIM115\n",
    "            quantifier_output = quantify_criteria(\n",
    "                llm_config={\"config_list\": config_list},\n",
    "                criteria=criteria,\n",
    "                task=task,\n",
    "                test_case=test_case,\n",
    "                ground_truth=ground_truth,\n",
    "            )\n",
    "            outcome[gameid] = quantifier_output\n",
    "\n",
    "# store the evaluated problems\n",
    "with open(\"../test/test_files/agenteval-in-out/evaluated_problems.json\", \"w\") as file:\n",
    "    json.dump(outcome, file, indent=2)  # use `json.loads` to do the reverse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qbrRRiP_EGCT"
   },
   "source": [
    "## Plotting the estimated performance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here you can find an example of how to visualize the obtained result in the histogram form (similar to the one in the blog post)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "LKu2xZJcEGCT",
    "outputId": "7780bc7c-382f-4ad3-b8c6-ac6051302303"
   },
   "outputs": [],
   "source": [
    "# computing average and 95% interval for failed and successful cases on all criteria\n",
    "\n",
    "\n",
    "with suppress(Exception):\n",
    "    criteria = Criterion.parse_json_str(open(criteria_file).read())  # noqa: SIM115\n",
    "\n",
    "\n",
    "nl2int = {}\n",
    "for criterion in criteria:\n",
    "    for score, v in enumerate(criterion.accepted_values):\n",
    "        nl2int[v] = score\n",
    "print(nl2int)\n",
    "\n",
    "average_s = {}\n",
    "average_f = {}\n",
    "\n",
    "conf_interval_s = {}\n",
    "conf_interval_f = {}\n",
    "\n",
    "for criterion in criteria:\n",
    "    task = {\"s\": [], \"f\": []}\n",
    "\n",
    "    for game in outcome:\n",
    "        with suppress(Exception):\n",
    "            tmp_dic = eval(outcome[game][\"estimated_performance\"])\n",
    "            if outcome[game][\"actual_success\"] == \"false\":\n",
    "                task[\"f\"].append(nl2int[tmp_dic[criterion.name]])\n",
    "            else:\n",
    "                task[\"s\"].append(nl2int[tmp_dic[criterion.name]])\n",
    "\n",
    "    average_f[criterion.name] = np.mean(task[\"f\"])\n",
    "    average_s[criterion.name] = np.mean(task[\"s\"])\n",
    "\n",
    "    conf_interval_s[criterion.name] = stats.norm.interval(0.95, loc=np.mean(task[\"s\"]), scale=stats.sem(task[\"s\"]))\n",
    "    conf_interval_f[criterion.name] = stats.norm.interval(0.95, loc=np.mean(task[\"f\"]), scale=stats.sem(task[\"f\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final plot would be saved in `../test/test_files/agenteval-in-out/estimated_performance.png`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 695
    },
    "id": "zqa86vwgEGCT",
    "outputId": "248cd0bc-0927-4d9f-b911-088bd76acf5d"
   },
   "outputs": [],
   "source": [
    "# Create a bar plot with error bars for the average values of \"s\" and \"f\" for each criterion\n",
    "\n",
    "plt.figure(figsize=(12, 8))\n",
    "bar_width = 0.1\n",
    "index = np.arange(len(criteria))\n",
    "\n",
    "\n",
    "plt.bar(\n",
    "    index,\n",
    "    list(average_s.values()),\n",
    "    bar_width,\n",
    "    label=f\"success ({len(task['s'])} samples)\",\n",
    "    color=\"darkblue\",\n",
    "    yerr=[(avg - conf_interval_s[key][0]) for key, avg in average_s.items()],\n",
    "    capsize=5,\n",
    ")\n",
    "plt.bar(\n",
    "    index + bar_width,\n",
    "    list(average_f.values()),\n",
    "    bar_width,\n",
    "    label=f\"failed ({len(task['f'])} samples)\",\n",
    "    color=\"lightblue\",\n",
    "    yerr=[(avg - conf_interval_f[key][0]) for key, avg in average_f.items()],\n",
    "    capsize=5,\n",
    ")\n",
    "\n",
    "plt.xlabel(\"Criteria\", fontsize=16)\n",
    "plt.ylabel(\"Average Value\", fontsize=16)\n",
    "plt.title(\n",
    "    \"Average Values of 3 different baselines cases with 95% Confidence Intervals - math problems \", fontsize=12, pad=10\n",
    ")  # Adjust titlepad to move the title further above\n",
    "plt.xticks(index + bar_width / 2, [crit.name for crit in criteria], rotation=45, fontsize=14)\n",
    "plt.legend(loc=\"upper center\", fontsize=14, bbox_to_anchor=(0.5, 1), ncol=3)  # Adjust legend placement and ncol\n",
    "plt.tight_layout()  # Adjust subplot parameters to fit the labels\n",
    "plt.ylim(0, 5)\n",
    "plt.savefig(\"../test/test_files/agenteval-in-out/estimated_performance.png\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "front_matter": {
   "description": "AgentEval: a multi-agent system for assessing utility of LLM-powered applications",
   "tags": [
    "eval"
   ]
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "949777d72b0d2535278d3dc13498b2535136f6dfe0678499012e853ee9abcab1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
